14 research outputs found

    Spoken data on corpus platforms : user-specific views and CLARIN concordancers

    Get PDF

    Map Task Corpus of Heritage BCMS spoken by second-generation speakers in Switzerland

    Full text link
    In this paper, we present a corpus for heritage Bosnian/Croatian/Montenegrin/Serbian (BCMS) spoken in German-speaking Switzerland. The corpus consists of elicited conversations between 29 second-generation speakers originating from different regions of former Yugoslavia. In total, the corpus contains 30 turn-aligned transcripts with an average length of 6 min. It is enriched with extensive speakers’ metadata, annotations, and pre-calculated corpus counts. The corpus can be accessed through an interactive corpus platform that allows for browsing, querying, and filtering, but also for creating and sharing custom annotations. Principal user groups we address with this corpus are researchers of heritage BCMS, as well as students and teachers of BCMS living in diaspora. In addition to introducing the corpus platform and the workflows we adopted to create it, we also present a case study on BCMS spoken by a pair of siblings who participated in the map task, and discuss advantages and challenges of using this corpus platform for linguistic research

    Swiss-AL: platform for language data in applied sciences : on challenges in the field of language open research data

    Get PDF
    Open Science is transforming the way researchers collect, process, analyze, and store empirical research data, particularly in the social sciences and humanities, where language data is crucial. This transformation processespecially concerns developers and providers oflarge language corporaand manifests itself in at least three challengeswhen providing these corpora as Open Research Data (ORD). Challenges concernheterogeneous practices that researchers apply when working with language data, research data lifecycle, and legal and ethical aspect. In this paper, we present Swiss-AL, a language data platform developed in Switzerland that is being transformed into an Open Research Data Resource for Applied Sciences within the Swiss Open Science Strategy. The paper gives an overview over the data contained in Swiss-AL and the infrastructure that is used to process and analyze the data. Furthermore, it presents approaches to the three abovementioned challenges to language ORD

    Swiss-AL : language data platform for applied sciences

    Get PDF
    Language data is used not only by linguists, but also in many other research disciplines. In the Swiss-AL project, we aim to develop practices for a diverse community interested in exploring the potential of language data while implementing FAIR data management principles and integrating them into the Swiss-AL workbench

    Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address

    Get PDF
    This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes

    Lexical Explorer: extending access to the Database for Spoken German for user-specific purposes

    Full text link
    This paper presents Lexical Explorer,2 a tool that allows interactive browsing and filtering of quantitative corpus information. It further describes how this tool can be used to support linguistic work on corpora of spoken German. By using Lexical Explorer, users can analyse quantitative corpus data by interacting with frequency tables and obtaining customised word profiles of word distribution across word form variation, co-occurrences and metadata. Interaction with corpus examples of particular corpus counts is also enabled. Lexical Explorer was developed as a prototype for user-specific corpus access and is aimed at researchers of German lexicon in spoken interaction. Although Lexical Explorer was developed on the basis of two small speech corpora of the German language, the underlying principle of this tool can be easily adapted to other corpora and other user groups. Moreover, the tool can be used to gain insights into the corpus structure as well as to study and verify corpus content in a transparent and user-friendly way

    Map task corpus of heritage BCMS 1.0

    No full text
    The Map task corpus of heritage Bosnian/Croatian/Montenegrin/Serbian (BCMS) consists of elicited conversations (map tasks) by 29 second-generation BCMS speakers originating from different regions of former Yugoslavia and living in German-speaking Switzerland. The corpus is suited for researchers of heritage BCMS, as well as students and teachers of BCMS living in diaspora. The corpus contains 30 turn-aligned transcripts with an average length of 6 minutes. The texts are annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features. The corpus is enriched with corpus-specific annotations of truncations, elongations, stutter and code-switches. It is distributed in source TEI and derived vertical formats

    Lexical explorer

    No full text
    Das Tool Lexical Explorer ermöglicht, die Korpus-Frequenzangaben vom FOLK (Forschung und Lehrkorpus Gesprochenes Deutsch; Schmidt 2014) und GeWiss (Gesprochene Wissenschaftssprache; Fandrych, Meißner & Wallner 2017) zu durchsuchen und abzufragen. Das Tool besteht aus Tabellen, die für die Zwecke des Projekts LeGeDe entwickelt wurden (Möhrs et al. 2017). Die Zahlen beruhen auf dem DGD-Release 2.10 (23.05.2018). Für den Vergleich zwischen Korpora der gesprochenen Sprache und DeReKo wird die DeReKo Version 2016-II (30.09.2016) ohne Subkorpora Wikipedia-Daten (Artikel, Diskussionen) und ohne Sprachliche Umbrüche (45/68) verwendet (vgl. Kupietz & Keibel 2009). Die Tabellen werden mit Hilfe von DataTables (plug-in for jQuery) präsentiert, wobei die Ajax Protokolle benutzt werden, um die Tabellen asynchron aus der Datenbank zu ziehen. Die Benutzung des Tools setzt die Vertrautheit mit der Annotation der Korpora in der DGD voraus.The Lexical Explorer enables browsing and filtering quantitative data of the corpora FOLK (Research and Teaching Corpus Spoken German; Schmidt 2014) and GeWiss (Spoken Scientific Language; Fandrych, Meißner & Wallner 2017). The tool consists of tables developed for the purposes of the LeGeDe project (Möhrs et al. 2017). The calculations are based on the DGD Release 2.10 (23.05.2018). For the comparison between the spoken language corpora and DeReKo, the DeReKo version 2016-II (30.09.2016) is used without the subcorpora Wikipedia-data (articles, discussions) and Sprachliche Umbrüche (45/68) (cf. Kupietz & Keibel 2009). The tables are presented with DataTables (plug-in for jQuery) while the Ajax protocols are used to extract the tables from the database asynchronously. The use of the tool requires familiarity with the annotation of corpora in the DGD
    corecore